A tourism company named "Visit with us" wants to enable and establish a viable business model to expand the customer base.
A viable business model is a central concept that helps you to understand the existing ways of doing the business and how to change the ways for the benefit of the tourism sector.
One of the ways to expand the customer base is to introduce a new offering of packages.
Currently, there are 5 types of packages the company is offering - Basic, Standard, Deluxe, Super Deluxe, King. Looking at the data of the last year, we observed that 18% of the customers purchased the packages.
The company in the last campaign contacted the customers at random without looking at the available information. However, this time company is now planning to launch a new product i.e. Wellness Tourism Package. Wellness Tourism is defined as Travel that allows the traveler to maintain, enhance or kick-start a healthy lifestyle, and support or increase one's sense of well-being, and wants to harness the available data of existing and potential customers to make the marketing expenditure more efficient.
This dataset contains the information of the 'Visit with us' customer data.
import warnings
warnings.filterwarnings("ignore")
# Libraries to help with reading and manipulating data
import pandas as pd
import numpy as np
# Library to split data
from sklearn.model_selection import train_test_split
# libaries to help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns
# Removes the limit from the number of displayed columns and rows.
# This is so I can see the entire dataframe when I print it
pd.set_option("display.max_columns", None)
# pd.set_option('display.max_rows', None)
pd.set_option("display.max_rows", 200)
# To build linear model for statistical analysis and prediction
import statsmodels.stats.api as sms
from statsmodels.stats.outliers_influence import variance_inflation_factor
import statsmodels.api as sm
from statsmodels.tools.tools import add_constant
# To build sklearn model
from sklearn.linear_model import LogisticRegression
# To get different metric scores
from sklearn import metrics
from sklearn.metrics import f1_score,accuracy_score, recall_score, precision_score, roc_auc_score, roc_curve, confusion_matrix, precision_recall_curve
# To build Decision Tree
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
# To build bagging classifier and Random Forest model
from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import RandomForestClassifier
# For hyperparameter tuning
from sklearn.model_selection import GridSearchCV
# To build boosting models
from sklearn.ensemble import AdaBoostClassifier, GradientBoostingClassifier, StackingClassifier
#To install xgboost library use - !pip install xgboost
from xgboost import XGBClassifier
#Import the data source csv file as a data frame
data = pd.read_excel('Tourism.xlsx', sheet_name = 'Tourism')
#Make a copy to avoid any changes to the original data
Tour_data = data.copy()
#Print the first five rows of the dataset
print(Tour_data.head())
#Print the last five rows of the dataset
print(Tour_data.tail())
CustomerID ProdTaken Age TypeofContact CityTier DurationOfPitch \
0 200000 1 41.0 Self Enquiry 3 6.0
1 200001 0 49.0 Company Invited 1 14.0
2 200002 1 37.0 Self Enquiry 1 8.0
3 200003 0 33.0 Company Invited 1 9.0
4 200004 0 NaN Self Enquiry 1 8.0
Occupation Gender NumberOfPersonVisiting NumberOfFollowups \
0 Salaried Female 3 3.0
1 Salaried Male 3 4.0
2 Free Lancer Male 3 4.0
3 Salaried Female 2 3.0
4 Small Business Male 2 3.0
ProductPitched PreferredPropertyStar MaritalStatus NumberOfTrips \
0 Deluxe 3.0 Single 1.0
1 Deluxe 4.0 Divorced 2.0
2 Basic 3.0 Single 7.0
3 Basic 3.0 Divorced 2.0
4 Basic 4.0 Divorced 1.0
Passport PitchSatisfactionScore OwnCar NumberOfChildrenVisiting \
0 1 2 1 0.0
1 0 3 1 2.0
2 1 3 0 0.0
3 1 5 1 1.0
4 0 5 1 0.0
Designation MonthlyIncome
0 Manager 20993.0
1 Manager 20130.0
2 Executive 17090.0
3 Executive 17909.0
4 Executive 18468.0
CustomerID ProdTaken Age TypeofContact CityTier DurationOfPitch \
4883 204883 1 49.0 Self Enquiry 3 9.0
4884 204884 1 28.0 Company Invited 1 31.0
4885 204885 1 52.0 Self Enquiry 3 17.0
4886 204886 1 19.0 Self Enquiry 3 16.0
4887 204887 1 36.0 Self Enquiry 1 14.0
Occupation Gender NumberOfPersonVisiting NumberOfFollowups \
4883 Small Business Male 3 5.0
4884 Salaried Male 4 5.0
4885 Salaried Female 4 4.0
4886 Small Business Male 3 4.0
4887 Salaried Male 4 4.0
ProductPitched PreferredPropertyStar MaritalStatus NumberOfTrips \
4883 Deluxe 4.0 Unmarried 2.0
4884 Basic 3.0 Single 3.0
4885 Standard 4.0 Married 7.0
4886 Basic 3.0 Single 3.0
4887 Basic 4.0 Unmarried 3.0
Passport PitchSatisfactionScore OwnCar NumberOfChildrenVisiting \
4883 1 1 1 1.0
4884 1 3 1 2.0
4885 0 1 1 3.0
4886 0 5 0 2.0
4887 1 3 1 2.0
Designation MonthlyIncome
4883 Manager 26576.0
4884 Executive 21212.0
4885 Senior Manager 31820.0
4886 Executive 20289.0
4887 Executive 24041.0
#Print the number of rows and columns in dataset
print (Tour_data.shape)
(4888, 20)
#Check for null values
print (Tour_data.isna().sum())
CustomerID 0 ProdTaken 0 Age 226 TypeofContact 25 CityTier 0 DurationOfPitch 251 Occupation 0 Gender 0 NumberOfPersonVisiting 0 NumberOfFollowups 45 ProductPitched 0 PreferredPropertyStar 26 MaritalStatus 0 NumberOfTrips 140 Passport 0 PitchSatisfactionScore 0 OwnCar 0 NumberOfChildrenVisiting 66 Designation 0 MonthlyIncome 233 dtype: int64
#Check for duplicates
print(Tour_data.duplicated().sum())
0
#Check the column datatypes
print(Tour_data.info())
<class 'pandas.core.frame.DataFrame'> RangeIndex: 4888 entries, 0 to 4887 Data columns (total 20 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 CustomerID 4888 non-null int64 1 ProdTaken 4888 non-null int64 2 Age 4662 non-null float64 3 TypeofContact 4863 non-null object 4 CityTier 4888 non-null int64 5 DurationOfPitch 4637 non-null float64 6 Occupation 4888 non-null object 7 Gender 4888 non-null object 8 NumberOfPersonVisiting 4888 non-null int64 9 NumberOfFollowups 4843 non-null float64 10 ProductPitched 4888 non-null object 11 PreferredPropertyStar 4862 non-null float64 12 MaritalStatus 4888 non-null object 13 NumberOfTrips 4748 non-null float64 14 Passport 4888 non-null int64 15 PitchSatisfactionScore 4888 non-null int64 16 OwnCar 4888 non-null int64 17 NumberOfChildrenVisiting 4822 non-null float64 18 Designation 4888 non-null object 19 MonthlyIncome 4655 non-null float64 dtypes: float64(7), int64(7), object(6) memory usage: 763.9+ KB None
We see that ID column does not have any statistical values and hence we will be dropping that column as a part of clean up.
Dependent variable is the ProdTaken which is of numeric data type.
Some of the variables like Owncar,PitchSatisfactionScore, Passport,PrefferedPropertyStar, are of numeric data type.We will be categorizing them soon.
All of the object data types will be converted to categorical data types.
There are missing values in the MonthlyIncome, NumberOfChildrenVisiting, NumberOfTrips, PrefferedPropertyStar, NumberOfFollowups, DurationOfPitch, TypeofContact, Age columns of the dataset.
Tour_data.drop(['CustomerID'],axis=1,inplace=True)
#Check the summary of dataset
print(Tour_data.describe().T)
count mean std min 25% \
ProdTaken 4888.0 0.188216 0.390925 0.0 0.0
Age 4662.0 37.622265 9.316387 18.0 31.0
CityTier 4888.0 1.654255 0.916583 1.0 1.0
DurationOfPitch 4637.0 15.490835 8.519643 5.0 9.0
NumberOfPersonVisiting 4888.0 2.905074 0.724891 1.0 2.0
NumberOfFollowups 4843.0 3.708445 1.002509 1.0 3.0
PreferredPropertyStar 4862.0 3.581037 0.798009 3.0 3.0
NumberOfTrips 4748.0 3.236521 1.849019 1.0 2.0
Passport 4888.0 0.290917 0.454232 0.0 0.0
PitchSatisfactionScore 4888.0 3.078151 1.365792 1.0 2.0
OwnCar 4888.0 0.620295 0.485363 0.0 0.0
NumberOfChildrenVisiting 4822.0 1.187267 0.857861 0.0 1.0
MonthlyIncome 4655.0 23619.853491 5380.698361 1000.0 20346.0
50% 75% max
ProdTaken 0.0 0.0 1.0
Age 36.0 44.0 61.0
CityTier 1.0 3.0 3.0
DurationOfPitch 13.0 20.0 127.0
NumberOfPersonVisiting 3.0 3.0 5.0
NumberOfFollowups 4.0 4.0 6.0
PreferredPropertyStar 3.0 4.0 5.0
NumberOfTrips 3.0 4.0 22.0
Passport 0.0 1.0 1.0
PitchSatisfactionScore 3.0 4.0 5.0
OwnCar 1.0 1.0 1.0
NumberOfChildrenVisiting 1.0 2.0 3.0
MonthlyIncome 22347.0 25571.0 98678.0
ProdTaken: 75% of the data is No confirming only 18% have purchased some tour package last year.
Age: Average age of the customers in the dataset is 37 years, age has a wide range from 18 to 61 years.
CityTier: 50% of the customers are from Tier 1 cities.
DurationOfPitch: On average the duration of pitch is around 15 minutes.A vast difference in the 75th(20 minutes) percentile and the maximum value(127 minutes), indicates that there might be outliers present in this variable.
NumberOfPersonVisiting: We can see that on average a group of 3 people are travelling for the tour packages.
NumberOfFollowups: The average number of times the follow up by the sales team with the customer is 4.
PreferredPropertyStar: Around 50% of the customers prefer a 3 start property, 25% prefer 4 star and another 25% prefer 5 star properties.
NumberOfTrips:The average number of trips for the customers is 3. A vast difference in the 75th(4 trips) and the maximum value(22 trips), indicates that there might be outliers present in this variable.
Passport: 50% of the customers do not have a passport.
OwnCar: 25% of the customers do not own a car.
PitchSatisfactionScore: The mean satisfaction score is 3.
NumberOfChildrenVisiting: 25% of the customers do not have children.50% of the customers in the dataset have one children.
MonthlyIncome: The average Income of the customers is around(23k USD).There is a difference between minimum(1k USD) and 25th percentile(20k USD) and a vast difference in 75th percentile(25K USD) and the maximum value(98k USD), indicates that there might be outliers present in the variable.
# Assigning the dataframe columns to a variable
num_columns = Tour_data.describe(include = 'all').columns
num_columns
Index(['ProdTaken', 'Age', 'TypeofContact', 'CityTier', 'DurationOfPitch',
'Occupation', 'Gender', 'NumberOfPersonVisiting', 'NumberOfFollowups',
'ProductPitched', 'PreferredPropertyStar', 'MaritalStatus',
'NumberOfTrips', 'Passport', 'PitchSatisfactionScore', 'OwnCar',
'NumberOfChildrenVisiting', 'Designation', 'MonthlyIncome'],
dtype='object')
for i in num_columns:
print('Unique values in',i, 'are :')
print(Tour_data[i].value_counts())
print('*'*50)
Unique values in ProdTaken are :
0 3968
1 920
Name: ProdTaken, dtype: int64
**************************************************
Unique values in Age are :
35.0 237
36.0 231
34.0 211
31.0 203
30.0 199
32.0 197
33.0 189
37.0 185
29.0 178
38.0 176
41.0 155
39.0 150
28.0 147
40.0 146
42.0 142
27.0 138
43.0 130
46.0 121
45.0 116
26.0 106
44.0 105
51.0 90
47.0 88
50.0 86
25.0 74
52.0 68
53.0 66
48.0 65
49.0 65
55.0 64
54.0 61
56.0 58
24.0 56
22.0 46
23.0 46
59.0 44
21.0 41
20.0 38
19.0 32
58.0 31
60.0 29
57.0 29
18.0 14
61.0 9
Name: Age, dtype: int64
**************************************************
Unique values in TypeofContact are :
Self Enquiry 3444
Company Invited 1419
Name: TypeofContact, dtype: int64
**************************************************
Unique values in CityTier are :
1 3190
3 1500
2 198
Name: CityTier, dtype: int64
**************************************************
Unique values in DurationOfPitch are :
9.0 483
7.0 342
8.0 333
6.0 307
16.0 274
15.0 269
14.0 253
10.0 244
13.0 223
11.0 205
12.0 195
17.0 172
30.0 95
22.0 89
31.0 83
23.0 79
18.0 75
32.0 74
29.0 74
21.0 73
25.0 73
27.0 72
26.0 72
24.0 70
35.0 66
20.0 65
28.0 61
33.0 57
19.0 57
34.0 50
36.0 44
5.0 6
126.0 1
127.0 1
Name: DurationOfPitch, dtype: int64
**************************************************
Unique values in Occupation are :
Salaried 2368
Small Business 2084
Large Business 434
Free Lancer 2
Name: Occupation, dtype: int64
**************************************************
Unique values in Gender are :
Male 2916
Female 1817
Fe Male 155
Name: Gender, dtype: int64
**************************************************
Unique values in NumberOfPersonVisiting are :
3 2402
2 1418
4 1026
1 39
5 3
Name: NumberOfPersonVisiting, dtype: int64
**************************************************
Unique values in NumberOfFollowups are :
4.0 2068
3.0 1466
5.0 768
2.0 229
1.0 176
6.0 136
Name: NumberOfFollowups, dtype: int64
**************************************************
Unique values in ProductPitched are :
Basic 1842
Deluxe 1732
Standard 742
Super Deluxe 342
King 230
Name: ProductPitched, dtype: int64
**************************************************
Unique values in PreferredPropertyStar are :
3.0 2993
5.0 956
4.0 913
Name: PreferredPropertyStar, dtype: int64
**************************************************
Unique values in MaritalStatus are :
Married 2340
Divorced 950
Single 916
Unmarried 682
Name: MaritalStatus, dtype: int64
**************************************************
Unique values in NumberOfTrips are :
2.0 1464
3.0 1079
1.0 620
4.0 478
5.0 458
6.0 322
7.0 218
8.0 105
21.0 1
19.0 1
22.0 1
20.0 1
Name: NumberOfTrips, dtype: int64
**************************************************
Unique values in Passport are :
0 3466
1 1422
Name: Passport, dtype: int64
**************************************************
Unique values in PitchSatisfactionScore are :
3 1478
5 970
1 942
4 912
2 586
Name: PitchSatisfactionScore, dtype: int64
**************************************************
Unique values in OwnCar are :
1 3032
0 1856
Name: OwnCar, dtype: int64
**************************************************
Unique values in NumberOfChildrenVisiting are :
1.0 2080
2.0 1335
0.0 1082
3.0 325
Name: NumberOfChildrenVisiting, dtype: int64
**************************************************
Unique values in Designation are :
Executive 1842
Manager 1732
Senior Manager 742
AVP 342
VP 230
Name: Designation, dtype: int64
**************************************************
Unique values in MonthlyIncome are :
21020.0 7
20855.0 7
17342.0 7
21288.0 7
17741.0 6
..
24924.0 1
19507.0 1
21108.0 1
20953.0 1
1000.0 1
Name: MonthlyIncome, Length: 2475, dtype: int64
**************************************************
num_missing = Tour_data.isnull().sum(axis=1)
num_missing.value_counts()
0 4128 1 533 2 202 3 25 dtype: int64
Tour_data[num_missing == 2]
| ProdTaken | Age | TypeofContact | CityTier | DurationOfPitch | Occupation | Gender | NumberOfPersonVisiting | NumberOfFollowups | ProductPitched | PreferredPropertyStar | MaritalStatus | NumberOfTrips | Passport | PitchSatisfactionScore | OwnCar | NumberOfChildrenVisiting | Designation | MonthlyIncome | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 11 | 0 | NaN | Self Enquiry | 1 | 21.0 | Salaried | Female | 2 | 4.0 | Deluxe | 3.0 | Single | 1.0 | 1 | 3 | 0 | 0.0 | Manager | NaN |
| 19 | 0 | NaN | Self Enquiry | 1 | 8.0 | Salaried | Male | 2 | 3.0 | Basic | 3.0 | Single | 6.0 | 1 | 4 | 0 | 1.0 | Executive | NaN |
| 20 | 0 | NaN | Company Invited | 1 | 17.0 | Salaried | Female | 3 | 2.0 | Deluxe | 3.0 | Married | 1.0 | 0 | 3 | 1 | 2.0 | Manager | NaN |
| 26 | 1 | NaN | Company Invited | 1 | 22.0 | Salaried | Female | 3 | 5.0 | Basic | 5.0 | Single | 2.0 | 1 | 4 | 1 | 2.0 | Executive | NaN |
| 44 | 0 | NaN | Company Invited | 1 | 6.0 | Small Business | Female | 2 | 3.0 | Deluxe | 3.0 | Single | 2.0 | 0 | 3 | 1 | 0.0 | Manager | NaN |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 2390 | 1 | 34.0 | Company Invited | 3 | NaN | Salaried | Female | 2 | 5.0 | Basic | 3.0 | Single | 2.0 | 0 | 3 | 0 | 1.0 | Executive | NaN |
| 2399 | 1 | NaN | Company Invited | 3 | 19.0 | Large Business | Female | 2 | 3.0 | Deluxe | 4.0 | Single | 6.0 | 0 | 3 | 1 | 0.0 | Manager | NaN |
| 2410 | 1 | NaN | Self Enquiry | 1 | 30.0 | Small Business | Male | 2 | 3.0 | Basic | 4.0 | Married | 2.0 | 1 | 1 | 0 | 0.0 | Executive | NaN |
| 2430 | 1 | NaN | Self Enquiry | 1 | 14.0 | Small Business | Female | 3 | 3.0 | Basic | 5.0 | Married | 2.0 | 1 | 3 | 0 | 2.0 | Executive | NaN |
| 2431 | 1 | 35.0 | Company Invited | 1 | NaN | Small Business | Male | 3 | 3.0 | Basic | 4.0 | Married | 2.0 | 1 | 3 | 1 | 0.0 | Executive | NaN |
202 rows × 19 columns
for n in num_missing.value_counts().sort_index().index:
if n > 0:
print(f'For the rows with exactly {n} missing values, NaN are found in:')
n_miss_per_col = Tour_data[num_missing == n].isnull().sum()
print(n_miss_per_col[n_miss_per_col > 0])
print('\n\n')
For the rows with exactly 1 missing values, NaN are found in: Age 96 DurationOfPitch 154 NumberOfFollowups 45 PreferredPropertyStar 26 NumberOfTrips 140 NumberOfChildrenVisiting 66 MonthlyIncome 6 dtype: int64 For the rows with exactly 2 missing values, NaN are found in: Age 130 DurationOfPitch 72 MonthlyIncome 202 dtype: int64 For the rows with exactly 3 missing values, NaN are found in: TypeofContact 25 DurationOfPitch 25 MonthlyIncome 25 dtype: int64
#Print the data info
Tour_data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 4888 entries, 0 to 4887 Data columns (total 19 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ProdTaken 4888 non-null int64 1 Age 4662 non-null float64 2 TypeofContact 4863 non-null object 3 CityTier 4888 non-null int64 4 DurationOfPitch 4637 non-null float64 5 Occupation 4888 non-null object 6 Gender 4888 non-null object 7 NumberOfPersonVisiting 4888 non-null int64 8 NumberOfFollowups 4843 non-null float64 9 ProductPitched 4888 non-null object 10 PreferredPropertyStar 4862 non-null float64 11 MaritalStatus 4888 non-null object 12 NumberOfTrips 4748 non-null float64 13 Passport 4888 non-null int64 14 PitchSatisfactionScore 4888 non-null int64 15 OwnCar 4888 non-null int64 16 NumberOfChildrenVisiting 4822 non-null float64 17 Designation 4888 non-null object 18 MonthlyIncome 4655 non-null float64 dtypes: float64(7), int64(6), object(6) memory usage: 725.7+ KB
#Check missing data count in Age column
Tour_data[Tour_data.Age.isnull()].count()
ProdTaken 226 Age 0 TypeofContact 226 CityTier 226 DurationOfPitch 226 Occupation 226 Gender 226 NumberOfPersonVisiting 226 NumberOfFollowups 226 ProductPitched 226 PreferredPropertyStar 226 MaritalStatus 226 NumberOfTrips 226 Passport 226 PitchSatisfactionScore 226 OwnCar 226 NumberOfChildrenVisiting 226 Designation 226 MonthlyIncome 96 dtype: int64
#Check the median value of Age column
Tour_data['Age'].median()
36.0
#Replace null values with median value for the Age column
Tour_data['Age'] = Tour_data['Age'].fillna(Tour_data['Age'].median())
#Check for null data in Age column if any
Tour_data[Tour_data.Age.isnull()]
| ProdTaken | Age | TypeofContact | CityTier | DurationOfPitch | Occupation | Gender | NumberOfPersonVisiting | NumberOfFollowups | ProductPitched | PreferredPropertyStar | MaritalStatus | NumberOfTrips | Passport | PitchSatisfactionScore | OwnCar | NumberOfChildrenVisiting | Designation | MonthlyIncome |
|---|
#Check missing data count in Age column
Tour_data[Tour_data.MonthlyIncome.isnull()].count()
ProdTaken 233 Age 233 TypeofContact 208 CityTier 233 DurationOfPitch 136 Occupation 233 Gender 233 NumberOfPersonVisiting 233 NumberOfFollowups 233 ProductPitched 233 PreferredPropertyStar 233 MaritalStatus 233 NumberOfTrips 233 Passport 233 PitchSatisfactionScore 233 OwnCar 233 NumberOfChildrenVisiting 233 Designation 233 MonthlyIncome 0 dtype: int64
#Check the median value of MonthlyIncome column
Tour_data['MonthlyIncome'].median()
22347.0
#Replace null values with median value for the MonthlyIncome column
Tour_data['MonthlyIncome'] = Tour_data['MonthlyIncome'].fillna(Tour_data['MonthlyIncome'].median())
#Check for null data in MonthlyIncome column if any
Tour_data[Tour_data.MonthlyIncome.isnull()]
| ProdTaken | Age | TypeofContact | CityTier | DurationOfPitch | Occupation | Gender | NumberOfPersonVisiting | NumberOfFollowups | ProductPitched | PreferredPropertyStar | MaritalStatus | NumberOfTrips | Passport | PitchSatisfactionScore | OwnCar | NumberOfChildrenVisiting | Designation | MonthlyIncome |
|---|
#Replace null values with Unknown value for the TypeofContact column
Tour_data['TypeofContact'] = Tour_data['TypeofContact'].fillna('Unknown')
#Replace null values with median value for the DurationOfPitch column
Tour_data['DurationOfPitch'] = Tour_data['DurationOfPitch'].fillna(Tour_data['DurationOfPitch'].median())
#Replace null values with median value for the NumberOfFollowups column
Tour_data['NumberOfFollowups'] = Tour_data['NumberOfFollowups'].fillna(Tour_data['NumberOfFollowups'].median())
#Replace null values with median value for the PreferredPropertyStar column
Tour_data['PreferredPropertyStar'] = Tour_data['PreferredPropertyStar'].fillna(Tour_data['PreferredPropertyStar'].median())
#Replace null values with median value for the NumberOfTrips column
Tour_data['NumberOfTrips'] = Tour_data['NumberOfTrips'].fillna(Tour_data['NumberOfTrips'].median())
#Replace null values with median value for the NumberOfChildrenVisiting column
Tour_data['NumberOfChildrenVisiting'] = Tour_data['NumberOfChildrenVisiting'].fillna(Tour_data['NumberOfChildrenVisiting'].median())
Tour_data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 4888 entries, 0 to 4887 Data columns (total 19 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ProdTaken 4888 non-null int64 1 Age 4888 non-null float64 2 TypeofContact 4888 non-null object 3 CityTier 4888 non-null int64 4 DurationOfPitch 4888 non-null float64 5 Occupation 4888 non-null object 6 Gender 4888 non-null object 7 NumberOfPersonVisiting 4888 non-null int64 8 NumberOfFollowups 4888 non-null float64 9 ProductPitched 4888 non-null object 10 PreferredPropertyStar 4888 non-null float64 11 MaritalStatus 4888 non-null object 12 NumberOfTrips 4888 non-null float64 13 Passport 4888 non-null int64 14 PitchSatisfactionScore 4888 non-null int64 15 OwnCar 4888 non-null int64 16 NumberOfChildrenVisiting 4888 non-null float64 17 Designation 4888 non-null object 18 MonthlyIncome 4888 non-null float64 dtypes: float64(7), int64(6), object(6) memory usage: 725.7+ KB
#Convert the data types of the variables
Tour_data['TypeofContact'] = Tour_data['TypeofContact'].astype('category')
Tour_data['Occupation'] = Tour_data['Occupation'].astype('category')
Tour_data['Gender'] = Tour_data['Gender'].astype('category')
Tour_data['ProductPitched'] = Tour_data['ProductPitched'].astype('category')
Tour_data['MaritalStatus'] = Tour_data['MaritalStatus'].astype('category')
Tour_data['Designation'] = Tour_data['Designation'].astype('category')
Tour_data['CityTier'] = Tour_data['CityTier'].astype('category')
Tour_data['NumberOfPersonVisiting'] = Tour_data['NumberOfPersonVisiting'].astype('category')
Tour_data['NumberOfFollowups'] = Tour_data['NumberOfFollowups'].astype('category')
Tour_data['Passport'] = Tour_data['Passport'].astype('category')
Tour_data['PitchSatisfactionScore'] = Tour_data['PitchSatisfactionScore'].astype('category')
Tour_data['PreferredPropertyStar'] = Tour_data['PreferredPropertyStar'].astype('category')
Tour_data['OwnCar'] = Tour_data['OwnCar'].astype('category')
Tour_data['NumberOfChildrenVisiting'] = Tour_data['NumberOfChildrenVisiting'].astype('category')
Tour_data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 4888 entries, 0 to 4887 Data columns (total 19 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ProdTaken 4888 non-null int64 1 Age 4888 non-null float64 2 TypeofContact 4888 non-null category 3 CityTier 4888 non-null category 4 DurationOfPitch 4888 non-null float64 5 Occupation 4888 non-null category 6 Gender 4888 non-null category 7 NumberOfPersonVisiting 4888 non-null category 8 NumberOfFollowups 4888 non-null category 9 ProductPitched 4888 non-null category 10 PreferredPropertyStar 4888 non-null category 11 MaritalStatus 4888 non-null category 12 NumberOfTrips 4888 non-null float64 13 Passport 4888 non-null category 14 PitchSatisfactionScore 4888 non-null category 15 OwnCar 4888 non-null category 16 NumberOfChildrenVisiting 4888 non-null category 17 Designation 4888 non-null category 18 MonthlyIncome 4888 non-null float64 dtypes: category(14), float64(4), int64(1) memory usage: 260.0 KB
Tour_data['Gender'] = Tour_data['Gender'].replace(['Fe Male'],'Female')
Tour_data['Gender'].unique()
['Female', 'Male'] Categories (2, object): ['Female', 'Male']
Tour_data['MaritalStatus'].unique()
['Single', 'Divorced', 'Married', 'Unmarried'] Categories (4, object): ['Single', 'Divorced', 'Married', 'Unmarried']
Tour_data['MaritalStatus'] = Tour_data['MaritalStatus'].replace(['Single', 'Divorced'],'Unmarried')
Tour_data['MaritalStatus'].unique()
['Unmarried', 'Married'] Categories (2, object): ['Unmarried', 'Married']
#Group the Age into AgeRange bucket by adding a new column to the dataframe
Tour_data['AgeRange'] = pd.cut(x = Tour_data['Age'],bins = [18,20,30,40,50,60,70])
#Group the MonthlyIncome into MonthlyIncomeRange bucket by adding a new column to the dataframe
Tour_data['MonthlyIncomeRange'] = pd.cut(x = Tour_data['MonthlyIncome'],bins = [1000,10000,20000,30000,40000,50000,60000,70000,80000,90000,100000])
#Group the NumberOfTrips into NumberOfTripsRange bucket by adding a new column to the dataframe
Tour_data['NumberOfTripsRange'] = pd.cut(x = Tour_data['NumberOfTrips'],bins = [0,5,10,15,20,25])
#Group the DurationOfPitch into DurationOfPitchRange bucket by adding a new column to the dataframe
Tour_data['DurationOfPitchRange'] = pd.cut(x = Tour_data['DurationOfPitch'],bins = [0,5,10,15,20,25,30,35,40,45,50,55,60,70,80,90,100,110,120,130])
print(Tour_data.head())
ProdTaken Age TypeofContact CityTier DurationOfPitch Occupation \ 0 1 41.0 Self Enquiry 3 6.0 Salaried 1 0 49.0 Company Invited 1 14.0 Salaried 2 1 37.0 Self Enquiry 1 8.0 Free Lancer 3 0 33.0 Company Invited 1 9.0 Salaried 4 0 36.0 Self Enquiry 1 8.0 Small Business Gender NumberOfPersonVisiting NumberOfFollowups ProductPitched \ 0 Female 3 3.0 Deluxe 1 Male 3 4.0 Deluxe 2 Male 3 4.0 Basic 3 Female 2 3.0 Basic 4 Male 2 3.0 Basic PreferredPropertyStar MaritalStatus NumberOfTrips Passport \ 0 3.0 Unmarried 1.0 1 1 4.0 Unmarried 2.0 0 2 3.0 Unmarried 7.0 1 3 3.0 Unmarried 2.0 1 4 4.0 Unmarried 1.0 0 PitchSatisfactionScore OwnCar NumberOfChildrenVisiting Designation \ 0 2 1 0.0 Manager 1 3 1 2.0 Manager 2 3 0 0.0 Executive 3 5 1 1.0 Executive 4 5 1 0.0 Executive MonthlyIncome AgeRange MonthlyIncomeRange NumberOfTripsRange \ 0 20993.0 (40, 50] (20000, 30000] (0, 5] 1 20130.0 (40, 50] (20000, 30000] (0, 5] 2 17090.0 (30, 40] (10000, 20000] (5, 10] 3 17909.0 (30, 40] (10000, 20000] (0, 5] 4 18468.0 (30, 40] (10000, 20000] (0, 5] DurationOfPitchRange 0 (5, 10] 1 (10, 15] 2 (5, 10] 3 (5, 10] 4 (5, 10]
# lets plot histogram of all numerical variables
all_col = Tour_data.select_dtypes(include=np.number).columns.tolist()
all_col.remove('ProdTaken')
plt.figure(figsize=(17, 75))
for i in range(len(all_col)):
plt.subplot(18, 3, i + 1)
sns.histplot(Tour_data[all_col[i]], kde=True) # you can comment the previous line and run this one to get distribution curves
plt.tight_layout()
plt.title(all_col[i], fontsize=25)
plt.show()
Age: Average age of the customers in the dataset is 37 years, most of them is concentrated between 30-40 years of age has a wide range from 18 to 61 years with almost a normal distribution.
DurationOfPitch: Most of the data is concentrated between 5 and 35 minutes.On average the duration of pitch is around 15 minutes.the data is right skewed with a long right tail indicating a lot of outliers on the right end.
NumberOfTrips:The average number of trips for the customers is 3. the data is heavily right skewed indicating that there are outliers on the right tail.
MonthlyIncome: MonthlyIncome data is concentrated between 20k - 40K.The average Income of the customers is around(23k USD).The data is heavily right skewed indicating the presence of lot of outliers present in the variable on the right tail. MonthlyIncome has outliers on both ends 1-20K and 40k to 210k.
# While doing uni-variate analysis of numerical variables we want to study their central tendency
# and dispersion.
# Let us write a function that will help us create boxplot and histogram for any input numerical
# variable.
# This function takes the numerical column as the input and returns the boxplots
# and histograms for the variable.
def histogram_boxplot(feature, figsize=(10,5), bins = None):
""" Boxplot and histogram combined
feature: 1-d feature array
figsize: size of fig (default (9,8))
bins: number of bins (default None / auto)
"""
f2, (ax_box2, ax_hist2) = plt.subplots(nrows = 2, # Number of rows of the subplot grid= 2
sharex = True, # x-axis will be shared among all subplots
gridspec_kw = {"height_ratios": (.25, .75)},
figsize = figsize
) # creating the 2 subplots
sns.boxplot(feature, ax=ax_box2, showmeans=True, color='yellow') # boxplot will be created and a star will indicate the mean value of the column
sns.distplot(feature, kde=F, ax=ax_hist2, bins=bins,color = 'cyan') if bins else sns.distplot(feature, kde=False, ax=ax_hist2,color='tab:red') # For histogram
ax_hist2.axvline(np.mean(feature), color='green', linestyle='--') # Add mean to the histogram
ax_hist2.axvline(np.median(feature), color='black', linestyle='-') # Add median to the histogram
histogram_boxplot(Tour_data.Age)
histogram_boxplot(Tour_data.DurationOfPitch)
histogram_boxplot(Tour_data.NumberOfTrips)
histogram_boxplot(Tour_data.MonthlyIncome)
# Function to create barplots that indicate percentage for each category.
def perc_on_bar(z):
'''
plot
feature: categorical feature
the function won't work if a column is passed in hue parameter
'''
total = len(Tour_data[z]) # length of the column
plt.figure(figsize=(15,5))
#plt.xticks(rotation=45)
ax = sns.countplot(Tour_data[z],palette='Paired')
for p in ax.patches:
percentage = '{:.1f}%'.format(100 * p.get_height()/total) # percentage of each class of the category
# x = p.get_x() + p.get_width() / 2 - 0.05 # width of the plot
# y = p.get_y() + p.get_height() # hieght of the plot
x = p.get_x() + p.get_width() / total + 0.2 # width of the plot
y = p.get_y() + p.get_height() # height of the plot
ax.annotate(percentage,(x, y), size = 10) # annotate the percantage
plt.xticks(rotation = 90)
plt.show() # show the plot
#Lets plot the percentage countplot for all categorical variables
perc_on_bar('ProdTaken')
perc_on_bar('TypeofContact')
perc_on_bar('Occupation')
perc_on_bar('Gender')
perc_on_bar('ProductPitched')
perc_on_bar('MaritalStatus')
perc_on_bar('Designation')
perc_on_bar('CityTier')
perc_on_bar('NumberOfPersonVisiting')
perc_on_bar('NumberOfFollowups')
perc_on_bar('Passport')
perc_on_bar('PitchSatisfactionScore')
perc_on_bar('PreferredPropertyStar')
perc_on_bar('OwnCar')
perc_on_bar('NumberOfChildrenVisiting')
#Plot the heat map to check the correlation between numeric variables
numeric_columns = Tour_data.select_dtypes(include = np.number).columns.tolist()
corr = (
Tour_data[numeric_columns].corr().sort_values(by=['ProdTaken'], ascending=False)
) # sorting correlations w.r.t Personal Loan
# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(28, 15))
# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(
corr,
cmap="seismic",
annot=True,
fmt=".1f",
vmin=-1,
vmax=1,
center=0,
square=False,
linewidths=0.7,
cbar_kws={"shrink": 0.5},
)
<AxesSubplot:>
# Pairplot for all the variables
sns.pairplot(data, hue = 'ProdTaken')
<seaborn.axisgrid.PairGrid at 0x1bc04a7a220>
### Function to plot stacked bar charts for categorical columns
def stacked_plot(x):
sns.set(palette='muted')
## crosstab
tab1 = pd.crosstab(x,data['ProdTaken'],margins=True)
print(tab1)
print('-'*100)
## visualising the cross tab
tab = pd.crosstab(x,data['ProdTaken'],normalize='index')
tab.plot(kind='bar',stacked=True,figsize=(17,7))
plt.legend(loc='lower left', frameon=False)
plt.legend(loc="upper left", bbox_to_anchor=(1,1))
plt.show()
stacked_plot(Tour_data['AgeRange'])
stacked_plot(Tour_data['MonthlyIncomeRange'])
stacked_plot(Tour_data['TypeofContact'])
stacked_plot(Tour_data['Occupation'])
stacked_plot(Tour_data['Gender'])
stacked_plot(Tour_data['NumberOfTripsRange'])
stacked_plot(Tour_data['DurationOfPitchRange'])
stacked_plot(Tour_data['ProductPitched'])
stacked_plot(Tour_data['MaritalStatus'])
stacked_plot(Tour_data['Designation'])
stacked_plot(Tour_data['CityTier'])
stacked_plot(Tour_data['NumberOfPersonVisiting'])
stacked_plot(Tour_data['NumberOfFollowups'])
stacked_plot(Tour_data['Passport'])
stacked_plot(Tour_data['PitchSatisfactionScore'])
stacked_plot(Tour_data['PreferredPropertyStar'])
ProdTaken 0 1 All AgeRange (18, 20] 24 46 70 (20, 30] 744 287 1031 (30, 40] 1805 346 2151 (40, 50] 929 144 1073 (50, 60] 451 89 540 (60, 70] 9 0 9 All 3962 912 4874 ----------------------------------------------------------------------------------------------------
ProdTaken 0 1 All MonthlyIncomeRange (1000, 10000] 1 0 1 (10000, 20000] 754 284 1038 (20000, 30000] 2682 576 3258 (30000, 40000] 528 60 588 (90000, 100000] 2 0 2 All 3967 920 4887 ----------------------------------------------------------------------------------------------------
ProdTaken 0 1 All TypeofContact Company Invited 1109 310 1419 Self Enquiry 2837 607 3444 Unknown 22 3 25 All 3968 920 4888 ----------------------------------------------------------------------------------------------------
ProdTaken 0 1 All Occupation Free Lancer 0 2 2 Large Business 314 120 434 Salaried 1954 414 2368 Small Business 1700 384 2084 All 3968 920 4888 ----------------------------------------------------------------------------------------------------
ProdTaken 0 1 All Gender Female 1630 342 1972 Male 2338 578 2916 All 3968 920 4888 ----------------------------------------------------------------------------------------------------
ProdTaken 0 1 All NumberOfTripsRange (0, 5] 3476 763 4239 (5, 10] 490 155 645 (15, 20] 0 2 2 (20, 25] 2 0 2 All 3968 920 4888 ----------------------------------------------------------------------------------------------------
ProdTaken 0 1 All DurationOfPitchRange (0, 5] 6 0 6 (5, 10] 1438 271 1709 (10, 15] 1156 240 1396 (15, 20] 504 139 643 (20, 25] 294 90 384 (25, 30] 280 94 374 (30, 35] 254 76 330 (35, 40] 34 10 44 (120, 130] 2 0 2 All 3968 920 4888 ----------------------------------------------------------------------------------------------------
ProdTaken 0 1 All ProductPitched Basic 1290 552 1842 Deluxe 1528 204 1732 King 210 20 230 Standard 618 124 742 Super Deluxe 322 20 342 All 3968 920 4888 ----------------------------------------------------------------------------------------------------
ProdTaken 0 1 All MaritalStatus Married 2014 326 2340 Unmarried 1954 594 2548 All 3968 920 4888 ----------------------------------------------------------------------------------------------------
ProdTaken 0 1 All Designation AVP 322 20 342 Executive 1290 552 1842 Manager 1528 204 1732 Senior Manager 618 124 742 VP 210 20 230 All 3968 920 4888 ----------------------------------------------------------------------------------------------------
ProdTaken 0 1 All CityTier 1 2670 520 3190 2 152 46 198 3 1146 354 1500 All 3968 920 4888 ----------------------------------------------------------------------------------------------------
ProdTaken 0 1 All NumberOfPersonVisiting 1 39 0 39 2 1151 267 1418 3 1942 460 2402 4 833 193 1026 5 3 0 3 All 3968 920 4888 ----------------------------------------------------------------------------------------------------
ProdTaken 0 1 All NumberOfFollowups 1.0 156 20 176 2.0 205 24 229 3.0 1222 244 1466 4.0 1726 387 2113 5.0 577 191 768 6.0 82 54 136 All 3968 920 4888 ----------------------------------------------------------------------------------------------------
ProdTaken 0 1 All Passport 0 3040 426 3466 1 928 494 1422 All 3968 920 4888 ----------------------------------------------------------------------------------------------------
ProdTaken 0 1 All PitchSatisfactionScore 1 798 144 942 2 498 88 586 3 1162 316 1478 4 750 162 912 5 760 210 970 All 3968 920 4888 ----------------------------------------------------------------------------------------------------
ProdTaken 0 1 All PreferredPropertyStar 3.0 2531 488 3019 4.0 731 182 913 5.0 706 250 956 All 3968 920 4888 ----------------------------------------------------------------------------------------------------
sns.countplot(x = 'NumberOfChildrenVisiting', hue = 'ProductPitched', data = Tour_data)
<AxesSubplot:xlabel='NumberOfChildrenVisiting', ylabel='count'>
sns.countplot(x = 'PreferredPropertyStar', hue = 'ProductPitched', data = Tour_data)
<AxesSubplot:xlabel='PreferredPropertyStar', ylabel='count'>
sns.countplot(x = 'MonthlyIncomeRange', hue = 'ProductPitched', data = Tour_data)
plt.xticks(rotation = 90)
plt.show()
sns.countplot(x = 'AgeRange', hue = 'ProductPitched', data = Tour_data)
plt.xticks(rotation = 90)
plt.show()
sns.countplot(x = 'TypeofContact', hue = 'ProductPitched', data = Tour_data)
plt.xticks(rotation = 90)
plt.show()
sns.countplot(x = 'CityTier', hue = 'ProductPitched', data = Tour_data)
plt.xticks(rotation = 90)
plt.show()
sns.countplot(x = 'Occupation', hue = 'ProductPitched', data = Tour_data)
plt.xticks(rotation = 90)
plt.show()
sns.countplot(x = 'Gender', hue = 'ProductPitched', data = Tour_data)
plt.xticks(rotation = 90)
plt.show()
sns.countplot(x = 'MaritalStatus', hue = 'ProductPitched', data = Tour_data)
plt.xticks(rotation = 90)
plt.show()
sns.countplot(x = 'NumberOfTrips', hue = 'ProductPitched', data = Tour_data)
plt.xticks(rotation = 90)
plt.show()
sns.countplot(x = 'Passport', hue = 'ProductPitched', data = Tour_data)
plt.xticks(rotation = 90)
plt.show()
sns.countplot(x = 'OwnCar', hue = 'ProductPitched', data = Tour_data)
plt.xticks(rotation = 90)
plt.show()
sns.countplot(x = 'PitchSatisfactionScore', hue = 'ProductPitched', data = Tour_data)
plt.xticks(rotation = 90)
plt.legend(bbox_to_anchor = (1,1))
plt.show()
sns.countplot(x = 'NumberOfFollowups', hue = 'ProductPitched', data = Tour_data)
plt.xticks(rotation = 90)
plt.show()
sns.countplot(x = 'DurationOfPitchRange', hue = 'ProductPitched', data = Tour_data)
plt.xticks(rotation = 90)
plt.legend(bbox_to_anchor = (1,1))
plt.show()
numerical_col = Tour_data.select_dtypes(include=np.number).columns.tolist()
numerical_col.remove('ProdTaken')
plt.figure(figsize=(20,30))
for i, variable in enumerate(numerical_col):
plt.subplot(5,4,i+1)
plt.boxplot(Tour_data[variable],whis=1.5)
plt.tight_layout()
plt.title(variable)
plt.show()
Let's treat using capping method and check again.
def treat_outliers(data,col):
'''
treats outliers in a varaible
col: str, name of the numerical varaible
data: data frame
col: name of the column
'''
Q1=data[col].quantile(0.25) # 25th quantile
Q3=data[col].quantile(0.75) # 75th quantile
IQR=Q3-Q1
Lower_Whisker = Q1 - 1.5*IQR
Upper_Whisker = Q3 + 1.5*IQR
data[col] = np.clip(data[col], Lower_Whisker, Upper_Whisker) # all the values smaller than Lower_Whisker will be assigned value of Lower_whisker
# and all the values above upper_whisker will be assigned value of upper_Whisker
return data
def treat_outliers_all(data, col_list):
'''
treat outlier in all numerical varaibles
col_list: list of numerical varaibles
data: data frame
'''
for c in col_list:
data = treat_outliers(data,c)
return data
numerical_col = Tour_data.select_dtypes(include=np.number).columns.tolist()# getting list of numerical columns
numerical_col.remove('ProdTaken')
Tour_data = treat_outliers_all(Tour_data,numerical_col)
numerical_col = Tour_data.select_dtypes(include=np.number).columns.tolist()
numerical_col.remove('ProdTaken')
plt.figure(figsize=(20,30))
for i, variable in enumerate(numerical_col):
plt.subplot(5,4,i+1)
plt.boxplot(Tour_data[variable],whis=1.5)
plt.tight_layout()
plt.title(variable)
plt.show()
Tour_data1 = Tour_data.copy()
Tour_data1.drop(['AgeRange','DurationOfPitchRange','NumberOfTripsRange','MonthlyIncomeRange'],axis=1,inplace=True)
Tour_data1.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 4888 entries, 0 to 4887 Data columns (total 19 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ProdTaken 4888 non-null int64 1 Age 4888 non-null float64 2 TypeofContact 4888 non-null category 3 CityTier 4888 non-null category 4 DurationOfPitch 4888 non-null float64 5 Occupation 4888 non-null category 6 Gender 4888 non-null category 7 NumberOfPersonVisiting 4888 non-null category 8 NumberOfFollowups 4888 non-null category 9 ProductPitched 4888 non-null category 10 PreferredPropertyStar 4888 non-null category 11 MaritalStatus 4888 non-null category 12 NumberOfTrips 4888 non-null float64 13 Passport 4888 non-null category 14 PitchSatisfactionScore 4888 non-null category 15 OwnCar 4888 non-null category 16 NumberOfChildrenVisiting 4888 non-null category 17 Designation 4888 non-null category 18 MonthlyIncome 4888 non-null float64 dtypes: category(14), float64(4), int64(1) memory usage: 259.9 KB
X = Tour_data1.drop(['ProdTaken'],axis=1)
X = pd.get_dummies(X,drop_first=True)
y = Tour_data1['ProdTaken']
# Splitting data into training and test set:
X_train, X_test, y_train, y_test =train_test_split(X, y, test_size=0.3, random_state=1,stratify=y)
print(X_train.shape, X_test.shape)
(3421, 41) (1467, 41)
y.value_counts(1)
0 0.811784 1 0.188216 Name: ProdTaken, dtype: float64
y_test.value_counts(1)
0 0.811861 1 0.188139 Name: ProdTaken, dtype: float64
Let's define a function to provide metric scores(accuracy,recall and precision) on train and test set and a function to show confusion matrix so that we do not have use the same code repetitively while evaluating models.
## Function to calculate different metric scores of the model - Accuracy, Recall , Precision and F1 Score
def get_metrics_score(model,flag=True):
'''
model : classifier to predict values of X
'''
# defining an empty list to store train and test results
score_list=[]
pred_train = model.predict(X_train)
pred_test = model.predict(X_test)
train_acc = model.score(X_train,y_train)
test_acc = model.score(X_test,y_test)
train_recall = metrics.recall_score(y_train,pred_train)
test_recall = metrics.recall_score(y_test,pred_test)
train_precision = metrics.precision_score(y_train,pred_train)
test_precision = metrics.precision_score(y_test,pred_test)
train_f1_score = metrics.f1_score(y_train,pred_train)
test_f1_score = metrics.f1_score(y_test,pred_test)
score_list.extend((train_acc,test_acc,train_recall,test_recall,train_precision,test_precision,train_f1_score,test_f1_score))
# If the flag is set to True then only the following print statements will be dispayed. The default value is set to True.
if flag == True:
print("Accuracy on training set : ",model.score(X_train,y_train))
print("Accuracy on test set : ",model.score(X_test,y_test))
print("Recall on training set : ",metrics.recall_score(y_train,pred_train))
print("Recall on test set : ",metrics.recall_score(y_test,pred_test))
print("Precision on training set : ",metrics.precision_score(y_train,pred_train))
print("Precision on test set : ",metrics.precision_score(y_test,pred_test))
print("F1 score on training set : ",metrics.f1_score(y_train,pred_train))
print("F1 score on test set : ",metrics.f1_score(y_test,pred_test))
return score_list # returning the list with train and test scores
def confusion_matrix_sklearn(model, predictors, target, threshold=0.5):
"""
To plot the confusion_matrix with percentages
model: classifier
predictors: independent variables
target: dependent variable
"""
y_pred = model.predict(predictors)
cm = confusion_matrix(target, y_pred)
labels = np.asarray(
[
["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm.flatten().sum())]
for item in cm.flatten()
]
).reshape(2, 2)
plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=labels, fmt="")
plt.ylabel("True label")
plt.xlabel("Predicted label")
We will build our model using the DecisionTreeClassifier function. Using default 'gini' criteria to split.
If the frequency of class A is 20% and the frequency of class B is 80%, then class B will become the dominant class and the decision tree will become biased toward the dominant classes.
In this case, we can pass a dictionary {0:0.19,1:0.81} to the model to specify the weight of each class and the decision tree will give more weightage to class 1.
class_weight is a hyperparameter for the decision tree classifier.
dtree = DecisionTreeClassifier(criterion='gini',class_weight={0:0.188,1:0.811},random_state=1)
dtree.fit(X_train, y_train)
DecisionTreeClassifier(class_weight={0: 0.188, 1: 0.811}, random_state=1)
confusion_matrix_sklearn(dtree, X_test, y_test)
Customer buys and the model predicted it correctly that the customer will buy the travel package : True Positive (observed=1,predicted=1)
Customer didn't buy and the model predicted customer will buy the travel package : False Positive (observed=0,predicted=1)
Customer didn't buy and the model predicted customer will not buy the travel package : True Negative (observed=0,predicted=0)
Customer buys and the model predicted that customer will not buy the travel package : False Negative (observed=1,predicted=0)
print("Decision Tree Model performance \n")
dtree_model_perf=get_metrics_score(dtree)
Decision Tree Model performance Accuracy on training set : 1.0 Accuracy on test set : 0.8766189502385822 Recall on training set : 1.0 Recall on test set : 0.6811594202898551 Precision on training set : 1.0 Precision on test set : 0.6690391459074733 F1 score on training set : 1.0 F1 score on test set : 0.6750448833034111
importances = dtree.feature_importances_
indices = np.argsort(importances)
feature_names = list(X.columns)
plt.figure(figsize=(12,12))
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='violet', align='center')
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()
bagging = BaggingClassifier(random_state=1)
bagging.fit(X_train,y_train)
BaggingClassifier(random_state=1)
confusion_matrix_sklearn(bagging, X_test, y_test)
print("Bagging Model performance \n")
bagging_model_perf=get_metrics_score(bagging)
Bagging Model performance Accuracy on training set : 0.9938614440222158 Accuracy on test set : 0.8970688479890934 Recall on training set : 0.9704968944099379 Recall on test set : 0.5760869565217391 Precision on training set : 0.9968102073365231 Precision on test set : 0.8238341968911918 F1 score on training set : 0.9834775767112511 F1 score on test set : 0.678038379530917
bagging_wt = BaggingClassifier(base_estimator=DecisionTreeClassifier(criterion='gini',class_weight={0:0.188,1:0.811},random_state=1),random_state=1)
bagging_wt.fit(X_train,y_train)
BaggingClassifier(base_estimator=DecisionTreeClassifier(class_weight={0: 0.188,
1: 0.811},
random_state=1),
random_state=1)
confusion_matrix_sklearn(bagging_wt,X_test,y_test)
print("Bagging weighted Model performance \n")
bagging_wt_model_perf=get_metrics_score(bagging_wt)
Bagging weighted Model performance Accuracy on training set : 0.9944460684010523 Accuracy on test set : 0.9004771642808452 Recall on training set : 0.9720496894409938 Recall on test set : 0.5434782608695652 Precision on training set : 0.9984051036682615 Precision on test set : 0.8823529411764706 F1 score on training set : 0.985051140833989 F1 score on test set : 0.672645739910314
rf = RandomForestClassifier(random_state=1)
rf.fit(X_train,y_train)
RandomForestClassifier(random_state=1)
confusion_matrix_sklearn(rf,X_test,y_test)
print("Random Forest Model performance \n")
rf_model_perf=get_metrics_score(rf)
Random Forest Model performance Accuracy on training set : 1.0 Accuracy on test set : 0.8997955010224948 Recall on training set : 1.0 Recall on test set : 0.5 Precision on training set : 1.0 Precision on test set : 0.9387755102040817 F1 score on training set : 1.0 F1 score on test set : 0.652482269503546
importances = rf.feature_importances_
indices = np.argsort(importances)
feature_names = list(X.columns)
plt.figure(figsize=(12,12))
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='violet', align='center')
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()
rf_wt = RandomForestClassifier(class_weight={0:0.188,1:0.811}, random_state=1)
rf_wt.fit(X_train,y_train)
RandomForestClassifier(class_weight={0: 0.188, 1: 0.811}, random_state=1)
confusion_matrix_sklearn(rf_wt, X_test,y_test)
print("Random Forest weighted Model performance \n")
rf_wt_model_perf=get_metrics_score(rf_wt)
Random Forest weighted Model performance Accuracy on training set : 1.0 Accuracy on test set : 0.8936605316973415 Recall on training set : 1.0 Recall on test set : 0.4673913043478261 Precision on training set : 1.0 Precision on test set : 0.9347826086956522 F1 score on training set : 1.0 F1 score on test set : 0.6231884057971014
importances = rf_wt.feature_importances_
indices = np.argsort(importances)
feature_names = list(X.columns)
plt.figure(figsize=(12,12))
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='violet', align='center')
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()
Hyperparameter tuning is also tricky in the sense that there is no direct way to calculate how a change in the hyperparameter value will reduce the loss of your model, so we usually resort to experimentation. i.e we'll use Grid search
Grid search is a tuning technique that attempts to compute the optimum values of hyperparameters.
It is an exhaustive search that is performed on a the specific parameter values of a model.
The parameters of the estimator/model used to apply these methods are optimized by cross-validated grid-search over a parameter grid.
#Choose the type of classifier.
dtree_estimator = DecisionTreeClassifier(class_weight={0:0.188,1:0.811},random_state=1)
# Grid of parameters to choose from
parameters = {'max_depth': np.arange(2,30),
'min_samples_leaf': [1, 2, 5, 7, 10],
'max_leaf_nodes' : [2, 3, 5, 10,15],
'min_impurity_decrease': [0.0001,0.001,0.01,0.1]
}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)
# Run the grid search
grid_obj = GridSearchCV(dtree_estimator, parameters, scoring=scorer)
grid_obj = grid_obj.fit(X_train, y_train)
# Set the clf to the best combination of parameters
dtree_estimator = grid_obj.best_estimator_
# Fit the best algorithm to the data.
dtree_estimator.fit(X_train, y_train)
DecisionTreeClassifier(class_weight={0: 0.188, 1: 0.811}, max_depth=6,
max_leaf_nodes=15, min_impurity_decrease=0.0001,
random_state=1)
confusion_matrix_sklearn(dtree_estimator, X_test,y_test)
print("Decision Tree Estimator Model Performance \n")
dtree_estimator_model_perf=get_metrics_score(dtree_estimator)
Decision Tree Estimator Model Performance Accuracy on training set : 0.7679041216018708 Accuracy on test set : 0.7743694614860259 Recall on training set : 0.6894409937888198 Recall on test set : 0.6920289855072463 Precision on training set : 0.4277456647398844 Precision on test set : 0.43707093821510296 F1 score on training set : 0.5279429250891795 F1 score on test set : 0.5357643758765778
importances = dtree_estimator.feature_importances_
indices = np.argsort(importances)
feature_names = list(X.columns)
plt.figure(figsize=(12,12))
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='violet', align='center')
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()
# grid search for bagging classifier
cl1 = DecisionTreeClassifier(class_weight={0:0.188,1:0.811},random_state=1)
param_grid = {'base_estimator':[cl1],
'n_estimators':[5,7,15,51,101],
'max_features': [0.7,0.8,0.9,1],
'max_samples':[0.65]
}
grid = GridSearchCV(BaggingClassifier(random_state=1,bootstrap=True), param_grid=param_grid, scoring = 'recall', cv = 5)
grid.fit(X_train, y_train)
GridSearchCV(cv=5, estimator=BaggingClassifier(random_state=1),
param_grid={'base_estimator': [DecisionTreeClassifier(class_weight={0: 0.188,
1: 0.811},
random_state=1)],
'max_features': [0.7, 0.8, 0.9, 1],
'max_samples': [0.65],
'n_estimators': [5, 7, 15, 51, 101]},
scoring='recall')
# getting the best estimator
bagging_estimator = grid.best_estimator_
bagging_estimator.fit(X_train,y_train)
BaggingClassifier(base_estimator=DecisionTreeClassifier(class_weight={0: 0.188,
1: 0.811},
random_state=1),
max_features=1, max_samples=0.65, n_estimators=51,
random_state=1)
confusion_matrix_sklearn(bagging_estimator, X_test,y_test)
print("Bagging Estimator Model Performance \n")
bagging_estimator_model_perf=get_metrics_score(bagging_estimator)
Bagging Estimator Model Performance Accuracy on training set : 0.6235019000292312 Accuracy on test set : 0.6223585548738922 Recall on training set : 0.7872670807453416 Recall on test set : 0.8079710144927537 Precision on training set : 0.3057901085645356 Precision on test set : 0.3080110497237569 F1 score on training set : 0.4404865334491747 F1 score on test set : 0.44600000000000006
# Choose the type of classifier.
rf_estimator = RandomForestClassifier(random_state=1)
# Grid of parameters to choose from
parameters = {
"n_estimators": [110,251,501],
"min_samples_leaf": np.arange(1, 6,1),
"max_features": [0.7,0.9,'log2','auto'],
"max_samples": [0.7,0.9,None],
}
# Run the grid search
grid_obj = GridSearchCV(rf_estimator, parameters, scoring='recall',cv=5)
grid_obj = grid_obj.fit(X_train, y_train)
# Set the clf to the best combination of parameters
rf_estimator = grid_obj.best_estimator_
# Fit the best algorithm to the data.
rf_estimator.fit(X_train, y_train)
RandomForestClassifier(max_features=0.9, n_estimators=501, random_state=1)
confusion_matrix_sklearn(rf_estimator, X_test,y_test)
print("Random Forest Estimator Model Performance \n")
rf_estimator_model_perf=get_metrics_score(rf_estimator)
Random Forest Estimator Model Performance Accuracy on training set : 1.0 Accuracy on test set : 0.9209270620313565 Recall on training set : 1.0 Recall on test set : 0.6630434782608695 Precision on training set : 1.0 Precision on test set : 0.8883495145631068 F1 score on training set : 1.0 F1 score on test set : 0.7593360995850623
# importance of features in the tree building ( The importance of a feature is computed as the
#(normalized) total reduction of the criterion brought by that feature. It is also known as the Gini importance )
print (pd.DataFrame(rf.feature_importances_, columns = ["Imp"], index = X_train.columns).sort_values(by = 'Imp', ascending = False))
Imp MonthlyIncome 0.125966 Age 0.123320 DurationOfPitch 0.098190 Passport_1 0.069332 NumberOfTrips 0.063453 CityTier_3 0.030489 Gender_Male 0.026177 Designation_Executive 0.026081 PreferredPropertyStar_5.0 0.026005 TypeofContact_Self Enquiry 0.024615 MaritalStatus_Unmarried 0.024281 OwnCar_1 0.022226 PitchSatisfactionScore_3 0.021327 PitchSatisfactionScore_4 0.019878 Occupation_Salaried 0.019508 Occupation_Small Business 0.019239 PreferredPropertyStar_4.0 0.017872 PitchSatisfactionScore_5 0.017596 NumberOfFollowups_4.0 0.017094 NumberOfFollowups_5.0 0.016532 NumberOfFollowups_3.0 0.016358 NumberOfChildrenVisiting_1.0 0.015811 Occupation_Large Business 0.015653 NumberOfPersonVisiting_3 0.015535 NumberOfPersonVisiting_2 0.015147 NumberOfChildrenVisiting_2.0 0.012535 NumberOfPersonVisiting_4 0.012497 NumberOfFollowups_6.0 0.010442 Designation_Manager 0.010272 ProductPitched_Deluxe 0.009898 CityTier_2 0.009111 PitchSatisfactionScore_2 0.008957 NumberOfChildrenVisiting_3.0 0.007443 Designation_Senior Manager 0.007354 ProductPitched_Standard 0.006832 NumberOfFollowups_2.0 0.005644 ProductPitched_Super Deluxe 0.005368 Designation_VP 0.002481 ProductPitched_King 0.002417 TypeofContact_Unknown 0.001035 NumberOfPersonVisiting_5 0.000030
feature_names = X_train.columns
importances = rf_estimator.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(12,12))
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='violet', align='center')
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()
!pip install xgboost
Requirement already satisfied: xgboost in c:\users\vivek\anaconda3\lib\site-packages (1.4.2) Requirement already satisfied: scipy in c:\users\vivek\anaconda3\lib\site-packages (from xgboost) (1.6.1) Requirement already satisfied: numpy in c:\users\vivek\anaconda3\lib\site-packages (from xgboost) (1.19.2)
abc = AdaBoostClassifier(random_state=1)
abc.fit(X_train,y_train)
AdaBoostClassifier(random_state=1)
confusion_matrix_sklearn(abc, X_test,y_test)
print("AdaBoost Model Performance \n")
Adaboost_model_perf=get_metrics_score(abc)
AdaBoost Model Performance Accuracy on training set : 0.84536685179772 Accuracy on test set : 0.8486707566462167 Recall on training set : 0.3027950310559006 Recall on test set : 0.29347826086956524 Precision on training set : 0.7090909090909091 Precision on test set : 0.75 F1 score on training set : 0.4243743199129489 F1 score on test set : 0.42187500000000006
importances = abc.feature_importances_
indices = np.argsort(importances)
feature_names = list(X.columns)
plt.figure(figsize=(12,12))
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='violet', align='center')
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()
gbc = GradientBoostingClassifier(random_state=1)
gbc.fit(X_train,y_train)
GradientBoostingClassifier(random_state=1)
confusion_matrix_sklearn(gbc, X_test,y_test)
print("GradientBoost Model Performance \n")
Gradientboost_model_perf=get_metrics_score(gbc)
GradientBoost Model Performance Accuracy on training set : 0.8842443729903537 Accuracy on test set : 0.8752556237218814 Recall on training set : 0.43788819875776397 Recall on test set : 0.4166666666666667 Precision on training set : 0.8924050632911392 Precision on test set : 0.8394160583941606 F1 score on training set : 0.5875 F1 score on test set : 0.5569007263922517
importances = gbc.feature_importances_
indices = np.argsort(importances)
feature_names = list(X.columns)
plt.figure(figsize=(12,12))
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='violet', align='center')
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()
xgb = XGBClassifier(random_state=1)
xgb.fit(X_train,y_train)
[17:37:48] WARNING: C:/Users/Administrator/workspace/xgboost-win64_release_1.4.0/src/learner.cc:1095: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
importance_type='gain', interaction_constraints='',
learning_rate=0.300000012, max_delta_step=0, max_depth=6,
min_child_weight=1, missing=nan, monotone_constraints='()',
n_estimators=100, n_jobs=4, num_parallel_tree=1, random_state=1,
reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1,
tree_method='exact', validate_parameters=1, verbosity=None)
confusion_matrix_sklearn(xgb, X_test,y_test)
print("XGBoost Model Performance \n")
XGboost_model_perf=get_metrics_score(xgb)
XGBoost Model Performance Accuracy on training set : 0.9982461268634902 Accuracy on test set : 0.923653715064758 Recall on training set : 0.9906832298136646 Recall on test set : 0.6811594202898551 Precision on training set : 1.0 Precision on test set : 0.8867924528301887 F1 score on training set : 0.9953198127925117 F1 score on test set : 0.7704918032786885
importances = xgb.feature_importances_
indices = np.argsort(importances)
feature_names = list(X.columns)
plt.figure(figsize=(12,12))
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='violet', align='center')
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()
# Choose the type of classifier.
abc_tuned = AdaBoostClassifier(random_state=1)
# Grid of parameters to choose from
## add from article
parameters = {
#Let's try different max_depth for base_estimator
"base_estimator":[DecisionTreeClassifier(max_depth=1),DecisionTreeClassifier(max_depth=2),DecisionTreeClassifier(max_depth=3)],
"n_estimators": np.arange(10,110,10),
"learning_rate":np.arange(0.1,2,0.1)
}
# Run the grid search
grid_obj = GridSearchCV(abc_tuned, parameters, scoring='recall',cv=5)
grid_obj = grid_obj.fit(X_train, y_train)
# Set the clf to the best combination of parameters
abc_tuned = grid_obj.best_estimator_
# Fit the best algorithm to the data.
abc_tuned.fit(X_train, y_train)
AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=3),
learning_rate=1.3000000000000003, n_estimators=100,
random_state=1)
confusion_matrix_sklearn(abc_tuned, X_test,y_test)
print("AdaBoost Tuned Model Performance \n")
Adaboost_tuned_perf=get_metrics_score(abc_tuned)
AdaBoost Tuned Model Performance Accuracy on training set : 0.9856767027185034 Accuracy on test set : 0.8895705521472392 Recall on training set : 0.937888198757764 Recall on test set : 0.677536231884058 Precision on training set : 0.9853181076672104 Precision on test set : 0.7192307692307692 F1 score on training set : 0.9610182975338106 F1 score on test set : 0.6977611940298507
importances = abc_tuned.feature_importances_
indices = np.argsort(importances)
feature_names = list(X.columns)
plt.figure(figsize=(12,12))
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='violet', align='center')
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()
gbc_init = GradientBoostingClassifier(init=AdaBoostClassifier(random_state=1),random_state=1)
gbc_init.fit(X_train,y_train)
GradientBoostingClassifier(init=AdaBoostClassifier(random_state=1),
random_state=1)
confusion_matrix_sklearn(gbc_init, X_test,y_test)
print("GradientBoost Init Model Performance \n")
Gradientboost_init_perf=get_metrics_score(gbc_init)
GradientBoost Init Model Performance Accuracy on training set : 0.8833674364220988 Accuracy on test set : 0.869120654396728 Recall on training set : 0.42857142857142855 Recall on test set : 0.39492753623188404 Precision on training set : 0.8990228013029316 Precision on test set : 0.8134328358208955 F1 score on training set : 0.5804416403785488 F1 score on test set : 0.5317073170731706
# Choose the type of classifier.
gbc_tuned = GradientBoostingClassifier(init='zero',random_state=1)
# Grid of parameters to choose from
## add from article
parameters = {
"n_estimators": [100,150,200,250],
"subsample":[0.8,0.9,1],
"max_features":[0.7,0.8,0.9,1]
}
# Run the grid search
grid_obj = GridSearchCV(gbc_tuned, parameters, scoring='recall',cv=5)
grid_obj = grid_obj.fit(X_train, y_train)
# Set the clf to the best combination of parameters
gbc_tuned = grid_obj.best_estimator_
# Fit the best algorithm to the data.
gbc_tuned.fit(X_train, y_train)
GradientBoostingClassifier(init='zero', max_features=0.7, n_estimators=250,
random_state=1, subsample=0.8)
confusion_matrix_sklearn(gbc_tuned, X_test,y_test)
print("GradientBoost Tuned Model performance \n")
Gradientboost_tuned_perf=get_metrics_score(gbc_tuned)
GradientBoost Tuned Model performance Accuracy on training set : 0.9155217772581117 Accuracy on test set : 0.880027266530334 Recall on training set : 0.5869565217391305 Recall on test set : 0.47101449275362317 Precision on training set : 0.942643391521197 Precision on test set : 0.8125 F1 score on training set : 0.7234449760765551 F1 score on test set : 0.5963302752293579
# Choose the type of classifier.
gbc_tuned1 = GradientBoostingClassifier(init=AdaBoostClassifier(random_state=1),random_state=1)
# Grid of parameters to choose from
## add from article
parameters = {
"n_estimators": [100,150,200,250],
"subsample":[0.8,0.9,1],
"max_features":[0.7,0.8,0.9,1]
}
# Run the grid search
grid_obj = GridSearchCV(gbc_tuned1, parameters, scoring='recall',cv=5)
grid_obj = grid_obj.fit(X_train, y_train)
# Set the clf to the best combination of parameters
gbc_tuned1 = grid_obj.best_estimator_
# Fit the best algorithm to the data.
gbc_tuned1.fit(X_train, y_train)
GradientBoostingClassifier(init=AdaBoostClassifier(random_state=1),
max_features=0.7, n_estimators=250, random_state=1,
subsample=0.9)
confusion_matrix_sklearn(gbc_tuned1, X_test,y_test)
print("GradientBoost Tuned1 Model Performance \n")
Gradientboost_tuned1_perf=get_metrics_score(gbc_tuned1)
GradientBoost Tuned1 Model Performance Accuracy on training set : 0.9152294650686934 Accuracy on test set : 0.8868438991138378 Recall on training set : 0.59472049689441 Recall on test set : 0.4963768115942029 Precision on training set : 0.9296116504854369 Precision on test set : 0.8353658536585366 F1 score on training set : 0.725378787878788 F1 score on test set : 0.6227272727272728
importances = gbc_tuned.feature_importances_
indices = np.argsort(importances)
feature_names = list(X.columns)
plt.figure(figsize=(12,12))
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='violet', align='center')
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()
XGBoost has many hyper parameters which can be tuned to increase the model performance.
Some of the important parameters are:
# Choose the type of classifier.
xgb_tuned = XGBClassifier(eval_metric=['error'],random_state=1)
# Grid of parameters to choose from
## add from
parameters = {
"n_estimators": [100,200],
"subsample":[0.9,1],
"scale_pos_weight":[1,5],
"learning_rate":[0.01,0.3],
"gamma":[1,3],
"colsample_bytree":[0.5,0.7],
"colsample_bylevel":[0.5,0.7]
}
# Run the grid search
grid_obj = GridSearchCV(xgb_tuned, parameters,scoring='recall',cv=5)
grid_obj = grid_obj.fit(X_train, y_train)
# Set the clf to the best combination of parameters
xgb_tuned = grid_obj.best_estimator_
# Fit the best algorithm to the data.
xgb_tuned.fit(X_train, y_train)
XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=0.5,
colsample_bynode=1, colsample_bytree=0.5, eval_metric=['error'],
gamma=1, gpu_id=-1, importance_type='gain',
interaction_constraints='', learning_rate=0.01, max_delta_step=0,
max_depth=6, min_child_weight=1, missing=nan,
monotone_constraints='()', n_estimators=100, n_jobs=4,
num_parallel_tree=1, random_state=1, reg_alpha=0, reg_lambda=1,
scale_pos_weight=5, subsample=1, tree_method='exact',
validate_parameters=1, verbosity=None)
confusion_matrix_sklearn(xgb_tuned, X_test,y_test)
XGboost_tuned_perf=get_metrics_score(xgb_tuned)
print("Training performance \n",XGboost_tuned_perf)
Accuracy on training set : 0.8322128032738966 Accuracy on test set : 0.809134287661895 Recall on training set : 0.8959627329192547 Recall on test set : 0.8405797101449275 Precision on training set : 0.5322878228782287 Precision on test set : 0.49572649572649574 F1 score on training set : 0.667824074074074 F1 score on test set : 0.6236559139784946 Training performance [0.8322128032738966, 0.809134287661895, 0.8959627329192547, 0.8405797101449275, 0.5322878228782287, 0.49572649572649574, 0.667824074074074, 0.6236559139784946]
importances = xgb_tuned.feature_importances_
indices = np.argsort(importances)
feature_names = list(X.columns)
plt.figure(figsize=(12,12))
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='violet', align='center')
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()
estimators=[('Decision Tree', dtree_estimator),('Random Forest', rf_estimator),
('Gradient Boosting', gbc_tuned)]
final_estimator=xgb
stacking_estimator=StackingClassifier(estimators=estimators, final_estimator=final_estimator,cv=5)
stacking_estimator.fit(X_train,y_train)
[19:15:41] WARNING: C:/Users/Administrator/workspace/xgboost-win64_release_1.4.0/src/learner.cc:1095: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
StackingClassifier(cv=5,
estimators=[('Decision Tree',
DecisionTreeClassifier(class_weight={0: 0.188,
1: 0.811},
max_depth=6,
max_leaf_nodes=15,
min_impurity_decrease=0.0001,
random_state=1)),
('Random Forest',
RandomForestClassifier(max_features=0.9,
n_estimators=501,
random_state=1)),
('Gradient Boosting',
GradientBoostingClassifier(init='zero',
max_features=0.7,
n_...
colsample_bytree=1, gamma=0,
gpu_id=-1,
importance_type='gain',
interaction_constraints='',
learning_rate=0.300000012,
max_delta_step=0, max_depth=6,
min_child_weight=1,
missing=nan,
monotone_constraints='()',
n_estimators=100, n_jobs=4,
num_parallel_tree=1,
random_state=1, reg_alpha=0,
reg_lambda=1,
scale_pos_weight=1,
subsample=1,
tree_method='exact',
validate_parameters=1,
verbosity=None))
confusion_matrix_sklearn(stacking_estimator, X_test,y_test)
stacking_estimator_model_perf = get_metrics_score(stacking_estimator)
Accuracy on training set : 0.9959076293481438 Accuracy on test set : 0.9100204498977505 Recall on training set : 0.9984472049689441 Recall on test set : 0.7028985507246377 Precision on training set : 0.9801829268292683 Precision on test set : 0.7950819672131147 F1 score on training set : 0.9892307692307692 F1 score on test set : 0.7461538461538463
# defining list of models
models = [dtree,bagging,bagging_wt,rf,rf_wt,dtree_estimator,bagging_estimator,rf_estimator, abc, abc_tuned, gbc, gbc_init, gbc_tuned, xgb, xgb_tuned, stacking_estimator]
# defining empty lists to add train and test results
acc_train = []
acc_test = []
recall_train = []
recall_test = []
precision_train = []
precision_test = []
f1_score_train = []
f1_score_test = []
# looping through all the models to get the accuracy, precall and precision scores
for model in models:
j = get_metrics_score(model,False)
acc_train.append(np.round(j[0],2))
acc_test.append(np.round(j[1],2))
recall_train.append(np.round(j[2],2))
recall_test.append(np.round(j[3],2))
precision_train.append(np.round(j[4],2))
precision_test.append(np.round(j[5],2))
f1_score_train.append(np.round(j[6],2))
f1_score_test.append(np.round(j[7],2))
comparison_frame = pd.DataFrame({'Model':['Decision Tree','Bagging','Bagging with class weights','Random Forest','Random Forest with weights','Decision Tree Tuned','Bagging Tuned','Random Forest Tuned','AdaBoost with default paramters','AdaBoost Tuned',
'Gradient Boosting with default parameters','Gradient Boosting with init=Adaboost',
'Gradient Boosting Tuned','XGBoost with default parameters', 'XGBoost Tuned', 'Stacking Estimator'],
'Train_Accuracy': acc_train,'Test_Accuracy': acc_test,
'Train_Recall':recall_train,'Test_Recall':recall_test,
'Train_Precision':precision_train,'Test_Precision':precision_test,
'Train_f1score':f1_score_train,'Test_f1score':f1_score_test})
comparison_frame
| Model | Train_Accuracy | Test_Accuracy | Train_Recall | Test_Recall | Train_Precision | Test_Precision | Train_f1score | Test_f1score | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | Decision Tree | 1.00 | 0.88 | 1.00 | 0.68 | 1.00 | 0.67 | 1.00 | 0.68 |
| 1 | Bagging | 0.99 | 0.90 | 0.97 | 0.58 | 1.00 | 0.82 | 0.98 | 0.68 |
| 2 | Bagging with class weights | 0.99 | 0.90 | 0.97 | 0.54 | 1.00 | 0.88 | 0.99 | 0.67 |
| 3 | Random Forest | 1.00 | 0.90 | 1.00 | 0.50 | 1.00 | 0.94 | 1.00 | 0.65 |
| 4 | Random Forest with weights | 1.00 | 0.89 | 1.00 | 0.47 | 1.00 | 0.93 | 1.00 | 0.62 |
| 5 | Decision Tree Tuned | 0.77 | 0.77 | 0.69 | 0.69 | 0.43 | 0.44 | 0.53 | 0.54 |
| 6 | Bagging Tuned | 0.62 | 0.62 | 0.79 | 0.81 | 0.31 | 0.31 | 0.44 | 0.45 |
| 7 | Random Forest Tuned | 1.00 | 0.92 | 1.00 | 0.66 | 1.00 | 0.89 | 1.00 | 0.76 |
| 8 | AdaBoost with default paramters | 0.85 | 0.85 | 0.30 | 0.29 | 0.71 | 0.75 | 0.42 | 0.42 |
| 9 | AdaBoost Tuned | 0.99 | 0.89 | 0.94 | 0.68 | 0.99 | 0.72 | 0.96 | 0.70 |
| 10 | Gradient Boosting with default parameters | 0.88 | 0.88 | 0.44 | 0.42 | 0.89 | 0.84 | 0.59 | 0.56 |
| 11 | Gradient Boosting with init=Adaboost | 0.88 | 0.87 | 0.43 | 0.39 | 0.90 | 0.81 | 0.58 | 0.53 |
| 12 | Gradient Boosting Tuned | 0.92 | 0.88 | 0.59 | 0.47 | 0.94 | 0.81 | 0.72 | 0.60 |
| 13 | XGBoost with default parameters | 1.00 | 0.92 | 0.99 | 0.68 | 1.00 | 0.89 | 1.00 | 0.77 |
| 14 | XGBoost Tuned | 0.83 | 0.81 | 0.90 | 0.84 | 0.53 | 0.50 | 0.67 | 0.62 |
| 15 | Stacking Estimator | 1.00 | 0.91 | 1.00 | 0.70 | 0.98 | 0.80 | 0.99 | 0.75 |
# defining list of models
models1 = [dtree,bagging,bagging_wt,rf,rf_wt,dtree_estimator,bagging_estimator,rf_estimator]
# defining empty lists to add train and test results
acc_train = []
acc_test = []
recall_train = []
recall_test = []
precision_train = []
precision_test = []
f1_score_train = []
f1_score_test = []
# looping through all the models to get the accuracy, precall and precision scores
for model in models1:
j = get_metrics_score(model,False)
acc_train.append(np.round(j[0],2))
acc_test.append(np.round(j[1],2))
recall_train.append(np.round(j[2],2))
recall_test.append(np.round(j[3],2))
precision_train.append(np.round(j[4],2))
precision_test.append(np.round(j[5],2))
f1_score_train.append(np.round(j[6],2))
f1_score_test.append(np.round(j[7],2))
comparison_frame1 = pd.DataFrame({'Model':['Decision Tree','Bagging','Bagging with class weights','Random Forest','Random Forest with weights','Decision Tree Tuned','Bagging Tuned','Random Forest Tuned'],
'Train_Accuracy': acc_train,'Test_Accuracy': acc_test,
'Train_Recall':recall_train,'Test_Recall':recall_test,
'Train_Precision':precision_train,'Test_Precision':precision_test,
'Train_f1score':f1_score_train,'Test_f1score':f1_score_test})
comparison_frame1
| Model | Train_Accuracy | Test_Accuracy | Train_Recall | Test_Recall | Train_Precision | Test_Precision | Train_f1score | Test_f1score | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | Decision Tree | 1.00 | 0.88 | 1.00 | 0.68 | 1.00 | 0.67 | 1.00 | 0.68 |
| 1 | Bagging | 0.99 | 0.90 | 0.97 | 0.58 | 1.00 | 0.82 | 0.98 | 0.68 |
| 2 | Bagging with class weights | 0.99 | 0.90 | 0.97 | 0.54 | 1.00 | 0.88 | 0.99 | 0.67 |
| 3 | Random Forest | 1.00 | 0.90 | 1.00 | 0.50 | 1.00 | 0.94 | 1.00 | 0.65 |
| 4 | Random Forest with weights | 1.00 | 0.89 | 1.00 | 0.47 | 1.00 | 0.93 | 1.00 | 0.62 |
| 5 | Decision Tree Tuned | 0.77 | 0.77 | 0.69 | 0.69 | 0.43 | 0.44 | 0.53 | 0.54 |
| 6 | Bagging Tuned | 0.62 | 0.62 | 0.79 | 0.81 | 0.31 | 0.31 | 0.44 | 0.45 |
| 7 | Random Forest Tuned | 1.00 | 0.92 | 1.00 | 0.66 | 1.00 | 0.89 | 1.00 | 0.76 |
# defining list of models
models2 = [abc, abc_tuned, gbc, gbc_init, gbc_tuned, xgb, xgb_tuned, stacking_estimator]
# defining empty lists to add train and test results
acc_train = []
acc_test = []
recall_train = []
recall_test = []
precision_train = []
precision_test = []
f1_score_train = []
f1_score_test = []
# looping through all the models to get the accuracy, precall and precision scores
for model in models2:
j = get_metrics_score(model,False)
acc_train.append(np.round(j[0],2))
acc_test.append(np.round(j[1],2))
recall_train.append(np.round(j[2],2))
recall_test.append(np.round(j[3],2))
precision_train.append(np.round(j[4],2))
precision_test.append(np.round(j[5],2))
f1_score_train.append(np.round(j[6],2))
f1_score_test.append(np.round(j[7],2))
comparison_frame2 = pd.DataFrame({'Model':['AdaBoost with default paramters','AdaBoost Tuned',
'Gradient Boosting with default parameters','Gradient Boosting with init=Adaboost',
'Gradient Boosting Tuned','XGBoost with default parameters', 'XGBoost Tuned', 'Stacking Estimator'],
'Train_Accuracy': acc_train,'Test_Accuracy': acc_test,
'Train_Recall':recall_train,'Test_Recall':recall_test,
'Train_Precision':precision_train,'Test_Precision':precision_test,
'Train_f1score':f1_score_train,'Test_f1score':f1_score_test})
comparison_frame2